home *** CD-ROM | disk | FTP | other *** search
Wrap
Text File | 1998-10-30 | 86.2 KB | 1,849 lines
PPPPaaaaggggeeee 1111 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) NNNNAAAAMMMMEEEE mp: mp_block, mp_blocktime, mp_create, mp_destroy, mp_my_threadnum, mp_numthreads, mp_set_numthreads, mp_setup, mp_unblock, mp_setlock, mp_suggested_numthreads,mp_unsetlock, mp_barrier, mp_in_doacross_loop, mp_set_slave_stacksize - FORTRAN multiprocessing utility routines SSSSYYYYNNNNOOOOPPPPSSSSIIIISSSS ssssuuuubbbbrrrroooouuuuttttiiiinnnneeee mmmmpppp____bbbblllloooocccckkkk(((()))) ssssuuuubbbbrrrroooouuuuttttiiiinnnneeee mmmmpppp____uuuunnnnbbbblllloooocccckkkk(((()))) ssssuuuubbbbrrrroooouuuuttttiiiinnnneeee mmmmpppp____bbbblllloooocccckkkkttttiiiimmmmeeee((((iiiitttteeeerrrrssss)))) iiiinnnntttteeeeggggeeeerrrr iiiitttteeeerrrrssss ssssuuuubbbbrrrroooouuuuttttiiiinnnneeee mmmmpppp____sssseeeettttuuuupppp(((()))) ssssuuuubbbbrrrroooouuuuttttiiiinnnneeee mmmmpppp____ccccrrrreeeeaaaatttteeee((((nnnnuuuummmm)))) iiiinnnntttteeeeggggeeeerrrr nnnnuuuummmm ssssuuuubbbbrrrroooouuuuttttiiiinnnneeee mmmmpppp____ddddeeeessssttttrrrrooooyyyy(((()))) iiiinnnntttteeeeggggeeeerrrr ffffuuuunnnnccccttttiiiioooonnnn mmmmpppp____nnnnuuuummmmtttthhhhrrrreeeeaaaaddddssss(((()))) ssssuuuubbbbrrrroooouuuuttttiiiinnnneeee mmmmpppp____sssseeeetttt____nnnnuuuummmmtttthhhhrrrreeeeaaaaddddssss((((nnnnuuuummmm)))) iiiinnnntttteeeeggggeeeerrrr nnnnuuuummmm iiiinnnntttteeeeggggeeeerrrr ffffuuuunnnnccccttttiiiioooonnnn mmmmpppp____mmmmyyyy____tttthhhhrrrreeeeaaaaddddnnnnuuuummmm(((()))) iiiinnnntttteeeeggggeeeerrrr ffffuuuunnnnccccttttiiiioooonnnn mmmmpppp____iiiissss____mmmmaaaasssstttteeeerrrr(((()))) ssssuuuubbbbrrrroooouuuuttttiiiinnnneeee mmmmpppp____sssseeeettttlllloooocccckkkk(((()))) iiiinnnntttteeeeggggeeeerrrr ffffuuuunnnnccccttttiiiioooonnnn mmmmpppp____ssssuuuuggggggggeeeesssstttteeeedddd____nnnnuuuummmmtttthhhhrrrreeeeaaaaddddssss((((nnnnuuuummmm)))) iiiinnnntttteeeeggggeeeerrrr nnnnuuuummmm ssssuuuubbbbrrrroooouuuuttttiiiinnnneeee mmmmpppp____uuuunnnnsssseeeettttlllloooocccckkkk(((()))) ssssuuuubbbbrrrroooouuuuttttiiiinnnneeee mmmmpppp____bbbbaaaarrrrrrrriiiieeeerrrr(((()))) llllooooggggiiiiccccaaaallll ffffuuuunnnnccccttttiiiioooonnnn mmmmpppp____iiiinnnn____ddddooooaaaaccccrrrroooossssssss____lllloooooooopppp(((()))) ssssuuuubbbbrrrroooouuuuttttiiiinnnneeee mmmmpppp____sssseeeetttt____ssssllllaaaavvvveeee____ssssttttaaaacccckkkkssssiiiizzzzeeee((((ssssiiiizzzzeeee)))) iiiinnnntttteeeeggggeeeerrrr ssssiiiizzzzeeee DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN These routines give some measure of control over the parallelism used in FORTRAN jobs. They should not be needed by most users, but will help to tune specific applications. PPPPaaaaggggeeee 2222 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) _m_p__b_l_o_c_k puts all slave threads to sleep via _b_l_o_c_k_p_r_o_c(2). This frees the processors for use by other jobs. This is useful if it is known that the slaves will not be needed for some time, and the machine is being shared by several users. Calls to _m_p__b_l_o_c_k may not be nested; a warning is issued if an attempt to do so is made. _m_p__u_n_b_l_o_c_k wakes up the slave threads that were previously blocked via _m_p__b_l_o_c_k. It is an error to unblock threads that are not currently blocked; a warning is issued if an attempt is made to do so. It is not necessary to explicitly call _m_p__u_n_b_l_o_c_k. When a FORTRAN parallel region is entered, a check is made, and if the slaves are currently blocked, a call is made to _m_p__u_n_b_l_o_c_k automatically. _m_p__b_l_o_c_k_t_i_m_e controls the amount of time a slave thread waits for work before giving up. When enough time has elapsed, the slave thread blocks itself. This automatic blocking is independent of the user level blocking provided by the _m_p__b_l_o_c_k/_m_p__u_n_b_l_o_c_k calls. Slave threads that have blocked themselves will be automatically unblocked upon entering a parallel region. The argument to _m_p__b_l_o_c_k_t_i_m_e is the number of times to spin in the wait loop. By default, it is set to 10,000,000. This takes about .25 seconds on a 200MHz processor. As a special case, an argument of 0 disables the automatic blocking, and the slaves will spin wait without limit. The environment variable _M_P__B_L_O_C_K_T_I_M_E may be set to an integer value. It acts like an implicit call to _m_p__b_l_o_c_k_t_i_m_e during program startup. _m_p__d_e_s_t_r_o_y deletes the slave threads. They are stopped by forcing them to call _e_x_i_t(2). In general, doing this is discouraged. _m_p__b_l_o_c_k can be used in most cases. _m_p__c_r_e_a_t_e creates and initializes threads. It creates enough threads so that the total number is equal to the argument. Since the calling thread already counts as one, _m_p__c_r_e_a_t_e will create one less than its argument in new slave threads. _m_p__s_e_t_u_p also creates and initializes threads. It takes no arguments. It simply calls _m_p__c_r_e_a_t_e using the current default number of threads. Unless otherwise specified, the default number is equal to the number of cpu's currently on the machine, or 8, whichever is less. If the user has not called either of the thread creation routines already, then _m_p__s_e_t_u_p is invoked automatically when the first parallel region is entered. If the environment variable _M_P__S_E_T_U_P is set, then _m_p__s_e_t_u_p is called during FORTRAN initialization, before any user code is executed. _m_p__n_u_m_t_h_r_e_a_d_s returns the number of threads that would participate in an immediately following parallel region. If the threads have already been created, then it returns the current number of threads. If the threads have not been created, then it returns the current default number of threads. The count includes the master thread. Knowing this count can be useful in optimizing certain kinds of parallel loops by hand, but this function has the side-effect of freezing the number of threads to the PPPPaaaaggggeeee 3333 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) returned value. As a result, this routine should be used sparingly. To determine the number of threads without this side-effect, see the description of _m_p__s_u_g_g_e_s_t_e_d__n_u_m_t_h_r_e_a_d_s below. _m_p__s_e_t__n_u_m_t_h_r_e_a_d_s sets the current default number of threads to the specified value. Note that this call does not directly create the threads, it only specifies the number that a subsequent _m_p__s_e_t_u_p call should use. If the environment variable _M_P__S_E_T__N_U_M_T_H_R_E_A_D_S is set, it acts like an implicit call to _m_p__s_e_t__n_u_m_t_h_r_e_a_d_s during program startup. For convenience when operating among several machines with different numbers of cpus, _M_P__S_E_T__N_U_M_T_H_R_E_A_D_S may be set to an expression involving integer literals, the binary operators + and -, the binary functions min and max, and the special symbolic value _A_L_L which stands for "the total number of available cpus on the current machine." Thus, something simple like setenv MP_SET_NUMTHREADS 7 would set the number of threads to seven. This may be a fine choice on an 8 cpu machine, but would be very bad on a 4 cpu machine. Instead, use something like setenv MP_SET_NUMTHREADS "max(1,all-1)" which sets the number of threads to be one less than the number of cpus on the current machine (but always at least one). If your configuration includes some machines with large numbers of cpus, setting an upper bound is a good idea. Something like: setenv MP_SET_NUMTHREADS "min(all,4)" will request (no more than) 4 cpus. For compatibility with earlier releases, _N_U_M__T_H_R_E_A_D_S is supported as a synonym for _M_P__S_E_T__N_U_M_T_H_R_E_A_D_S. _m_p__m_y__t_h_r_e_a_d_n_u_m returns an integer between 0 and _n-1 where _n is the value returned by _m_p__n_u_m_t_h_r_e_a_d_s. The master process is always thread 0. This is occasionally useful for optimizing certain kinds of loops by hand. _m_p__i_s__m_a_s_t_e_r returns 1 if called by the master process, 0 otherwise. _m_p__s_e_t_l_o_c_k provides convenient (though limited) access to the locking routines. The convenience is that no set up need be done; it may be called directly without any preliminaries. The limitation is that there is only one lock. It is analogous to the _u_s_s_e_t_l_o_c_k(3P) routine, but it takes no arguments and does not return a value. This is useful for serializing access to shared variables (e.g. counters) in a parallel region. Note that it will frequently be necessary to declare those variables as VOLATILE to ensure that the optimizer does not assign them to a register. _m_p__s_u_g_g_e_s_t_e_d__n_u_m_t_h_r_e_a_d_s uses the supplied value as a hint about how many threads to use in subsequent parallel regions, and returns the previous value of the number of threads to be employed in parallel regions. It does not affect currently executing parallel regions, if any. The implementation may ignore this hint depending on factors such as overall system load. This routine may also be called with the value 0, in which PPPPaaaaggggeeee 4444 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) case it simply returns the number of threads to be employed in parallel regions without the side-effect present in _m_p__n_u_m_t_h_r_e_a_d_s. _m_p__u_n_s_e_t_l_o_c_k is the companion routine for _m_p__s_e_t_l_o_c_k. It also takes no arguments and does not return a value. _m_p__b_a_r_r_i_e_r provides a simple interface to a single _b_a_r_r_i_e_r(3P). It may be used inside a parallel loop to force a barrier synchronization to occur among the parallel threads. The routine takes no arguments, returns no value, and does not require any initialization. _m_p__i_n__d_o_a_c_r_o_s_s__l_o_o_p answers the question "am I currently executing inside a parallel loop." This is needful in certain rare situations where you have an external routine that can be called both from inside a parallel loop and also from outside a parallel loop, and the routine must do different things depending on whether it is being called in parallel or not. _m_p__s_e_t__s_l_a_v_e__s_t_a_c_k_s_i_z_e sets the stacksize (in bytes) to be used by the slave processes when they are created (via _s_p_r_o_c_s_p(2)). The default size is 16MB. Note that slave processes only allocate their local data onto their stack, shared data (even if allocated on the master's stack) is not counted. DDDDiiiirrrreeeeccccttttiiiivvvveeeessss The MIPSpro Fortran 77 compiler allows you to apply the capabilities of a Silicon Graphics multiprocessor computer to the execution of a single job. By coding a few simple directives, the compiler splits the job into concurrently executing pieces, thereby decreasing the wall-clock run time of the job. Directives enable, disable, or modify a feature of the compiler. Essentially, directives are command line options specified within the input file instead of on the command line. Unlike command line options, directives have no default setting. To invoke a directive, you must either toggle it on or set a desired value for its level. Directives placed on the first line of the input file are called global directives. The compiler interprets them as if they appeared at the top of each program unit in the file. Use global directives to ensure that the program is compiled with the correct command line options. Directives appearing anywhere else in the file apply only until the end of the current program unit. The compiler resets the value of the directive to the global value at the start of the next program unit. (Set the global value using a command line option or a global directive.) Some command line options act like global directives. Other command line options override directives. Many directives have corresponding command line options. If you specify conflicting settings in the command line and a directive, the compiler chooses the most restrictive setting. For PPPPaaaaggggeeee 5555 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) Boolean options, if either the directive or the command line has the option turned off, it is considered off. For options that require a numeric value, the compiler uses the minimum of the command line setting and the directive setting. The Fortran compiler accepts directives that cause it to generate code that can be run in parallel. The compiler directives look like Fortran comments: they begin with a C in column one. If multiprocessing is not turned on, these statements are treated as comments. This allows the identical source to be compiled with a single-processing compiler or by Fortran without the multiprocessing option. The directives are distinguished by having a $ as the second character. The following directives are supported: CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS,,,, CCCC$$$$&&&&,,,, CCCC$$$$,,,, CCCC$$$$MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE,,,, CCCC$$$$CCCCHHHHUUUUNNNNKKKK,,,, and CCCC$$$$CCCCOOOOPPPPYYYYIIIINNNN.... CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS The essential compiler directive for multiprocessing is CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS.... This directive directs the compiler to generate special code to run iterations of a DO loop in parallel. The CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS directive applies only to the next statement (which must be a DO loop). The Fortran compiler does not support direct nesting of CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS loops. The CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS directive has the form C$DOACROSS [clause [ [,] clause ...] where valid values for the optional clause are [IF (logical_expression)] [{LOCAL | PRIVATE} (item[,item ...])] [{SHARE | SHARED} (item[,item ...])] [{LASTLOCAL | LAST LOCAL} (item[,item ...])] [REDUCTION (item[,item ...])] [MP_SCHEDTYPE=mode ] [CHUNK=integer_expression] The preferred form of the directive uses the optional commas between clauses. This section discusses the meaning of each clause. IIIIFFFF PPPPaaaaggggeeee 6666 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) The IIIIFFFF clause determines whether the loop is actually executed in parallel. If the logical expression is TRUE, the loop is executed in parallel. If the expression is FALSE, the loop is executed serially. LLLLOOOOCCCCAAAALLLL,,,, SSSSHHHHAAAARRRREEEE,,,, LLLLAAAASSSSTTTTLLLLOOOOCCCCAAAALLLL These clauses specify lists of variables used within parallel loops. A variable can appear in only one of these lists. To make the task of writing these lists easier, there are several defaults. The loop- iteration variable is LLLLAAAASSSSTTTTLLLLOOOOCCCCAAAALLLL by default. All other variables are SSSSHHHHAAAARRRREEEE by default. LLLLOOOOCCCCAAAALLLL Specifies variables that are local to each process. If a variable is declared as LOCAL, each iteration of the loop is given its own uninitialized copy of the variable. You can declare a variable as LLLLOOOOCCCCAAAALLLL if its value does not depend on any other iteration of the loop and if its value is used only within a single iteration. In effect the LLLLOOOOCCCCAAAALLLL variable is just temporary; a new copy can be created in each loop iteration without changing the final answer. The name LLLLOOOOCCCCAAAALLLL is preferred over PPPPRRRRIIIIVVVVAAAATTTTEEEE.... SSSSHHHHAAAARRRREEEE Specifies variables that are shared across all processes. If a variable is declared as SSSSHHHHAAAARRRREEEE,,,, all iterations of the loop use the same copy of the variable. You can declare a variable as SSSSHHHHAAAARRRREEEE if it is only read (not written) within the loop or if it is an array where each iteration of the loop uses a different element of the array. The name SSSSHHHHAAAARRRREEEE is preferred over SSSSHHHHAAAARRRREEEEDDDD.... LLLLAAAASSSSTTTTLLLLOOOOCCCCAAAALLLL Specifies variables that are local to each process.Unlike with the LLLLOOOOCCCCAAAALLLL clause, the compiler saves only the value of the logically last iteration of the loop when it exits. The name LLLLAAAASSSSTTTTLLLLOOOOCCCCAAAALLLL is preferred over LLLLAAAASSSSTTTT LLLLOOOOCCCCAAAALLLL.... LLLLOOOOCCCCAAAALLLL is a little faster than LLLLAAAASSSSTTTTLLLLOOOOCCCCAAAALLLL,,,, so if you do not need the final value, it is good practice to put the DO loop index variable into the LLLLOOOOCCCCAAAALLLL list, although this is not required. Only variables can appear in these lists. In particular, COMMON blocks cannot appear in a LLLLOOOOCCCCAAAALLLL list. The SSSSHHHHAAAARRRREEEE,,,, LLLLOOOOCCCCAAAALLLL,,,, and LLLLAAAASSSSTTTTLLLLOOOOCCCCAAAALLLL lists give only the names of the variables. If any member of the list is an array, it is listed without any subscripts. RRRREEEEDDDDUUUUCCCCTTTTIIIIOOOONNNN The RRRREEEEDDDDUUUUCCCCTTTTIIIIOOOONNNN clause specifies variables involved in a reduction operation. In a reduction operation, the compiler keeps local copies of the variables and combines them when it exits the loop. An element of the RRRREEEEDDDDUUUUCCCCTTTTIIIIOOOONNNN list must be an individual variable (also called a scalar variable) and cannot be an array. However, it can be an individual element of an array. In a RRRREEEEDDDDUUUUCCCCTTTTIIIIOOOONNNN clause, it would appear in the list with the proper subscripts. PPPPaaaaggggeeee 7777 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) One element of an array can be used in a reduction operation, while other elements of the array are used in other ways. To allow for this, if an element of an array appears in the RRRREEEEDDDDUUUUCCCCTTTTIIIIOOOONNNN list, the entire array can also appear in the SSSSHHHHAAAARRRREEEE list. The four types of reductions supported are _s_u_m(+), _p_r_o_d_u_c_t(*), _m_i_n(), and _m_a_x(). Note that _m_i_n(_m_a_x) reductions must use the _m_i_n(_m_a_x) intrinsic functions to be recognized correctly. The compiler confirms that the reduction expression is legal by making some simple checks. The compiler does not, however, check all statements in the DO loop for illegal reductions. You must ensure that the reduction variable is used correctly in a reduction operation. CCCCHHHHUUUUNNNNKKKK,,,, MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE The CCCCHHHHUUUUNNNNKKKK and MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE clauses affect the way the compiler schedules work among the participating tasks in a loop. These clauses do not affect the correctness of the loop. They are useful for tuning the performance of critical loops. For the MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE====mmmmooooddddeeee clause, mode can be one of the following: [SIMPLE | STATIC] [DYNAMIC] [INTERLEAVE INTERLEAVED] [GUIDED GSS] [RUNTIME] You can use any or all of these modes in a single program. The CCCCHHHHUUUUNNNNKKKK clause is valid only with the DDDDYYYYNNNNAAAAMMMMIIIICCCC and IIIINNNNTTTTEEEERRRRLLLLEEEEAAAAVVVVEEEE modes. SSSSIIIIMMMMPPPPLLLLEEEE,,,, DDDDYYYYNNNNAAAAMMMMIIIICCCC,,,, IIIINNNNTTTTEEEERRRRLLLLEEEEAAAAVVVVEEEE,,,, GGGGSSSSSSSS,,,, and RRRRUUUUNNNNTTTTIIIIMMMMEEEE are the preferred names for each mode. The simple method (MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE====SSSSIIIIMMMMPPPPLLLLEEEE) divides the iterations among processes by dividing them into contiguous pieces and assigning one piece to each process. In dynamic scheduling (MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE====DDDDYYYYNNNNAAAAMMMMIIIICCCC) the iterations are broken into pieces the size of which is specified with the CCCCHHHHUUUUNNNNKKKK clause. As each process finishes a piece, it enters a critical section to grab the next available piece. This gives good load balancing at the price of higher overhead. The interleave method (MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE====IIIINNNNTTTTEEEERRRRLLLLEEEEAAAAVVVVEEEE) breaks the iterations into pieces of the size specified by the CCCCHHHHUUUUNNNNKKKK option, and execution PPPPaaaaggggeeee 8888 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) of those pieces is interleaved among the processes. The fourth method is a variation of the guided self-scheduling algorithm (MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE====GGGGSSSSSSSS). Here, the piece size is varied depending on the number of iterations remaining. By parceling out relatively large pieces to start with and relatively small pieces toward the end, the system can achieve good load balancing while reducing the number of entries into the critical section. In addition to these four methods, you can specify the scheduling method at run time (MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE====RRRRUUUUNNNNTTTTIIIIMMMMEEEE). Here, the scheduling routine examines values in your run-time environment and uses that information to select one of the other four methods. If both the MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE and CCCCHHHHUUUUNNNNKKKK clauses are omitted, SSSSIIIIMMMMPPPPLLLLEEEE scheduling is assumed. If MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE is set to IIIINNNNTTTTEEEERRRRLLLLEEEEAAAAVVVVEEEE or DDDDYYYYNNNNAAAAMMMMIIIICCCC and the CCCCHHHHUUUUNNNNKKKK clause are omitted, CCCCHHHHUUUUNNNNKKKK====1111 is assumed. If MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE is set to one of the other values, CCCCHHHHUUUUNNNNKKKK is ignored. If the MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE clause is omitted, but CCCCHHHHUUUUNNNNKKKK is set, then MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE====DDDDYYYYNNNNAAAAMMMMIIIICCCC is assumed. CCCC$$$$&&&& Occasionally, the clauses in the CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS directive are longer than one line. Use the CCCC$$$$&&&& directive to continue the directive onto multiple lines. For example: C$DOACROSS share(ALPHA, BETA, GAMMA, DELTA, C$& EPSILON, OMEGA), LASTLOCAL(I, J, K, L, M, N), C$& LOCAL(XXX1, XXX2, XXX3, XXX4, XXX5, XXX6, XXX7, C$& XXX8, XXX9) CCCC$$$$ The CCCC$$$$ directive is considered a comment line except when multiprocessing. A line beginning with C$ is treated as a conditionally compiled Fortran statement. The rest of the line contains a standard Fortran statement. The statement is compiled only if multiprocessing is turned on. In this case, the C and $ are treated as if they are blanks. They can be used to insert debugging statements, or an experienced user can use them to insert arbitrary code into the multiprocessed version. CCCC$$$$MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE The CCCC$$$$MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE====mmmmooooddddeeee directive acts as an implicit MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE clause for all CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS directives in scope. mode is any of the modes listed under CCCCHHHHUUUUNNNNKKKK and MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE.... A CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS directive that does not have PPPPaaaaggggeeee 9999 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) an explicit MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE clause is given the value specified in the last directive prior to the look, rather than the normal default. If the CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS does have an explicit clause, then the explicit value is used. CCCC$$$$CCCCHHHHUUUUNNNNKKKK The CCCC$$$$CCCCHHHHUUUUNNNNKKKK====iiiinnnntttteeeeggggeeeerrrr____eeeexxxxpppprrrreeeessssssssiiiioooonnnn directive affects the CCCCHHHHUUUUNNNNKKKK clause of a CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS in the same way that the CCCC$$$$MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE directive affects the MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE clause for all CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS directives in scope. Both directives are in effect from the place they occur in the source until another corresponding directive is encountered or the end of the procedure is reached. CCCC$$$$CCCCOOOOPPPPYYYYIIIINNNN It is occasionally desirable to be able to copy values from the master thread's version of the COMMON block into the slave thread's version. The special directive CCCC$$$$CCCCOOOOPPPPYYYYIIIINNNN allows this. It has the form C$COPYIN item [, item -] Each item must be a member of a local COMMON block. It can be a variable, an array, an individual element of an array, or the entire COMMON block. Note: The CCCC$$$$CCCCOOOOPPPPYYYYIIIINNNN directive cannot be executed from inside a parallel region. OOOOppppeeeennnnMMMMPPPP SSSSuuuuppppppppoooorrrrtttt The -_m_p flag enables the processing of the parallel (MP) directives, including the original SGI/PCF directives (described below) as well as the OpenMP directives. To disable one or the other set use -_M_P:_o_l_d__m_p=_O_F_F or -_M_P:_o_p_e_n__m_p=_O_F_F. See the -_M_P option control group. For more information about OpenMP support in MIPSpro Fortran 77, please refer to the _M_I_P_S_p_r_o _F_o_r_t_r_a_n _7_7 _P_r_o_g_r_a_m_m_e_r'_s _G_u_i_d_e. For more information about OpenMP support in MIPSpro Fortran 90, please refer to the _M_I_P_S_P_r_o _7 _F_o_r_t_r_a_n _9_0 _C_o_m_m_a_n_d_s _a_n_d _D_i_r_e_c_t_i_v_e_s _R_e_f_e_r_e_n_c_e _M_a_n_u_a_l. For general information about OpenMP please refer to the following web page: http://www.openmp.org/ PPPPCCCCFFFF DDDDiiiirrrreeeeccccttttiiiivvvveeeessss PPPPaaaaggggeeee 11110000 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) In addition to the simple loop-level parallelism offered by CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS and the other directives described above, the compiler supports a more general model of parallelism. This model is based on the work done by the Parallel Computing Forum (PCF), which itself formed the basis for the proposed ANSI-X3H5 standard. The compiler supports this model through compiler directives, rather than extensions to the source language. For more information about PCF, please refer to Chapter 5 of the MIPSpro Fortran 77 Programmer's Guide. The directives can be used in Fortran 77 programs when compiled with the ----mmmmpppp option. CCCC$$$$PPPPAAAARRRR BBBBAAAARRRRRRRRIIIIEEEERRRR Ensures that each process waits until all processes reach the barrier before proceeding. CCCC$$$$PPPPAAAARRRR [[[[EEEENNNNDDDD]]]] CCCCRRRRIIIITTTTIIIICCCCAAAALLLL SSSSEEEECCCCTTTTIIIIOOOONNNN Ensures that the enclosed block of code is executed by only one process at a time by using a lock variable. CCCC$$$$PPPPAAAARRRR [[[[EEEENNNNDDDD]]]] PPPPAAAARRRRAAAALLLLLLLLEEEELLLL Encloses a parallel region, which includes work-sharing constructs and critical sections. CCCC$$$$PPPPAAAARRRR PPPPAAAARRRRAAAALLLLLLLLEEEELLLL DDDDOOOO Precedes a single DO loop for which separate iterations are executed by different processes. This directive is equivalent to the CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS directive. CCCC$$$$PPPPAAAARRRR [[[[EEEENNNNDDDD]]]] PPPPDDDDOOOO Separate iterations of the enclosed loop are executed by different processes. This directive must be inside a parallel region. CCCC$$$$PPPPAAAARRRR [[[[EEEENNNNDDDD]]]] PPPPSSSSEEEECCCCTTTTIIIIOOOONNNN[[[[SSSS]]]] Parcels out each block of code in turn to a process. CCCC$$$$PPPPAAAARRRR SSSSEEEECCCCTTTTIIIIOOOONNNN Signifies a starting line for an individual section within a parallel section. CCCC$$$$PPPPAAAARRRR [[[[EEEENNNNDDDD]]]] SSSSIIIINNNNGGGGLLLLEEEE PPPPRRRROOOOCCCCEEEESSSSSSSS Ensures that the enclosed block of code is executed by exactly one process. CCCC$$$$PPPPAAAARRRR &&&& Continues a PCF directive onto multiple lines. PPPPaaaarrrraaaalllllllleeeellll RRRReeeeggggiiiioooonnnn PPPPaaaaggggeeee 11111111 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) A parallel region encloses any number of PCF constructs. It signifies the boundary within which slave threads execute. A user program can contain any number of parallel regions. The syntax of the parallel region is: C$PAR PARALLEL [clause [[,] clause]...] code C$PAR END PARALLEL where valid clauses are: [IF ( logical_expression )] [{LOCAL | PRIVATE}(item [,item ...])] [{SHARE | SHARED}(item [,item ...])] The IIIIFFFF,,,, LLLLOOOOCCCCAAAALLLL,,,, and SSSSHHHHAAAARRRREEEEDDDD clauses have the same meaning as in the CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS directive. The preferred form of the directive has no commas between the clauses. The SSSSHHHHAAAARRRREEEEDDDD clause is preferred over SSSSHHHHAAAARRRREEEE and LLLLOOOOCCCCAAAALLLL is preferred over PPPPRRRRIIIIVVVVAAAATTTTEEEE.... PPPPCCCCFFFF CCCCoooonnnnssssttttrrrruuuuccccttttssss The three types of PCF constructs are work-sharing constructs, critical sections, and barriers. All master and slave threads synchronize at the bottom of a work-sharing construct. None of the threads continue past the end of the construct until they all have completed execution within that construct. The four work-sharing constructs are: _p_a_r_a_l_l_e_l _D_O, _P_D_O, _s_e_c_t_i_o_n_s and _s_i_n_g_l_e _p_r_o_c_e_s_s. If specified, these constructs (except for the parallel DO construct) must appear inside of a parallel region. Specifying a parallel DO construct inside of a parallel region produces a syntax error. The critical section construct protects a block of code with a lock so that it is executed by only one thread at a time. Threads do not synchronize at the bottom of a critical section. The barrier construct ensures that each process that is executing waits until all others reach the barrier before proceeding. PPPPaaaaggggeeee 11112222 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) PPPPaaaarrrraaaalllllllleeeellll DDDDOOOO The parallel DO construct is the same as the CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS directive and conceptually the same as a parallel region containing exactly one PDO construct and no other code. Each thread inside the enclosing parallel region executes separate iterations of the loop within the parallel DO construct. The syntax of the parallel DO construct is C$PAR PARALLEL DO [clause [[,] clause]...] where clause is defined as the same as for CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS.... For the CCCC$$$$PPPPAAAARRRR PPPPAAAARRRRAAAALLLLLLLLEEEELLLL DDDDOOOO directive, MMMMPPPP____SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE==== is optional; you can just specify mode. PPPPDDDDOOOO Each thread inside the enclosing parallel region executes a separate iteration of the loop within the PDO construct. The syntax of the PDO construct, which can only be specified within a parallel region, is: C$PAR PDO [clause [[,] clause]...] code [C$PAR END PDO [NOWAIT]] where valid values for clause are [{LOCAL | PRIVATE} (item[,item ...])] [{LASTLOCAL | LAST LOCAL} (item[,item ...])] [(ORDERED)] [ sched ] [ chunk ] LLLLOOOOCCCCAAAALLLL,,,, LLLLAAAASSSSTTTTLLLLOOOOCCCCAAAALLLL,,,, sssscccchhhheeeedddd,,,, and cccchhhhuuuunnnnkkkk have the same meaning as in the CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS directive. Note in particular that it is legal to declare a data item as LLLLOOOOCCCCAAAALLLL in a PPPPDDDDOOOO even if it was declared as SSSSHHHHAAAARRRREEEEDDDD in the enclosing parallel region. The ((((OOOORRRRDDDDEEEERRRREEEEDDDD)))) clause is equivalent to a sched clause of DDDDYYYYNNNNAAAAMMMMIIIICCCC and a chunk clause of 1. The parenthesis are required. LLLLAAAASSSSTTTTLLLLOOOOCCCCAAAALLLL is preferred over LLLLAAAASSSSTTTT LLLLOOOOCCCCAAAALLLL and LLLLOOOOCCCCAAAALLLL is preferred over PPPPRRRRIIIIVVVVAAAATTTTEEEE.... The EEEENNNNDDDD PPPPDDDDOOOO directive is optional. If specified, this directive must appear immediately after the end of the DO loop. The optional NNNNOOOOWWWWAAAAIIIITTTT clause specifies that each process should proceed directly to the code immediately following the directive. If you do not specify NNNNOOOOWWWWAAAAIIIITTTT,,,, the PPPPaaaaggggeeee 11113333 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) processes will wait until all have reached the directive before proceeding. PPPPaaaarrrraaaalllllllleeeellll SSSSeeeeccccttttiiiioooonnnnssss The parallel sections construct is a parallel version of the Fortran 90 SSSSEEEELLLLEEEECCCCTTTT statement. Each block of code is parceled out in turn to a separate thread. The syntax of the parallel sections construct is C$PAR PSECTION[S] [clause [[,]clause ]... code [C$PAR SECTION code] ... C$PAR END PSECTION[S] [NOWAIT] where the only valid value for clause is [{LOCAL | PRIVATE} (item [,item]) ] LLLLOOOOCCCCAAAALLLL is preferred over PPPPRRRRIIIIVVVVAAAATTTTEEEE and has the same meaning as for the CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS directive. Note in particular that it is legal to declare a data item as LLLLOOOOCCCCAAAALLLL in a parallel sections construct even if it was declared as SSSSHHHHAAAARRRREEEEDDDD in the enclosing parallel region. The optional NNNNOOOOWWWWAAAAIIIITTTT clause specifies that each process should proceed directly to the code immediately following the directive. If you do not specify NNNNOOOOWWWWAAAAIIIITTTT,,,, the processes will wait until all have reached the EEEENNNNDDDD PPPPSSSSEEEECCCCTTTTIIIIOOOONNNN directive before proceeding. Parallel sections must appear within a parallel region. They can contain critical section constructs but cannot contain any of the following types of constructs: _P_D_O, _p_a_r_a_l_l_e_l _D_O, _C$_D_O_A_C_R_O_S_S or _s_i_n_g_l_e _p_r_o_c_e_s_s. The sections within a parallel sections construct are assigned to threads one at a time, from the top down. There is no other implied ordering to the operations within the sections. In particular, a later section cannot depend on the results of an earlier section, unless some form of explicit synchronization is used. If there is such explicit synchronization, you must be sure that the lexical ordering of the blocks is a legal order of execution. SSSSiiiinnnngggglllleeee PPPPrrrroooocccceeeessssssss PPPPaaaaggggeeee 11114444 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) The single process construct, which can only be specified within a parallel region, ensures that a block of code is executed by exactly one process. The syntax of the single process construct is C$PAR SINGLE PROCESS [clause [[,] clause]...] code C$PAR END SINGLE PROCESS [NOWAIT] where the only valid value for clause is [{LOCAL | PRIVATE} (item [,item]) ] LLLLOOOOCCCCAAAALLLL is preferred over PPPPRRRRIIIIVVVVAAAATTTTEEEE and has the same meaning as for the CCCC$$$$DDDDOOOOAAAACCCCRRRROOOOSSSSSSSS directive. Note in particular that it is legal to declare a data item as LLLLOOOOCCCCAAAALLLL in a single process construct even if it was declared as SSSSHHHHAAAARRRREEEEDDDD in the enclosing parallel region. The optional NNNNOOOOWWWWAAAAIIIITTTT clause specifies that each process should proceed directly to the code immediately following the directive. If you do not specify NNNNOOOOWWWWAAAAIIIITTTT,,,, the processes will wait until all have reached the directive before proceeding. This construct is semantically equivalent to a parallel sections construct with only one section. The single process construct provides a more descriptive syntax. CCCCrrrriiiittttiiiiccccaaaallll SSSSeeeeccccttttiiiioooonnnn The critical section construct restricts execution of a block of code so that only one process can execute it at a time. Another process attempting to gain entry to the critical section must wait until the previous process has exited. The critical section construct can appear anywhere in a program, including inside and outside a parallel region and within a C$ DOACROSS loop. The syntax of the critical section construct is C$PAR CRITICAL SECTION [ ( lock_variable ) ] code C$PAR END CRITICAL SECTION The _l_o_c_k__v_a_r_i_a_b_l_e is an optional integer variable that must be initialized to zero. The parenthesis are required. If you do not specify _l_o_c_k__v_a_r_i_a_b_l_e, the compiler automatically supplies one. Multiple critical section constructs inside the same parallel region are considered to be independent of each other unless they use the same explicit _l_o_c_k__v_a_r_i_a_b_l_e. PPPPaaaaggggeeee 11115555 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) BBBBaaaarrrrrrrriiiieeeerrrr CCCCoooonnnnssssttttrrrruuuuccccttttssss A barrier construct ensures that each process waits until all processes reach the barrier before proceeding. The syntax of the barrier construct is CCCC$$$$PPPPAAAARRRR BBBBAAAARRRRRRRRIIIIEEEERRRR CCCC$$$$PPPPAAAARRRR &&&& Occasionally, the clauses in PCF directives are longer than one line. You can use the CCCC$$$$PPPPAAAARRRR &&&& directive to continue a directive onto multiple lines. For example: C$PAR PARALLEL local(i,j) C$PAR& shared(a,n,index_x,index_y,cur_max, C$PAR& big_max,bmax_x,bmax_y) RRRReeeessssttttrrrriiiiccccttttiiiioooonnnnssss The three work-sharing constructs, PPPPDDDDOOOO,,,, PPPPSSSSEEEECCCCTTTTIIIIOOOONNNN,,,, and SSSSIIIINNNNGGGGLLLLEEEE PPPPRRRROOOOCCCCEEEESSSSSSSS,,,, must be executed by all the threads executing in the parallel region (or none of the threads). The following is illegal: C$PAR PARALLEL if (mp_my_threadnum() .gt. 5) then C$PAR SINGLE PROCESS many_processes = .true. C$PAR END SINGLE PROCESS endif This code will hang forever when run with enough processes. One or more process will be stuck at the CCCC$$$$PPPPAAAARRRR EEEENNNNDDDD SSSSIIIINNNNGGGGLLLLEEEE PPPPRRRROOOOCCCCEEEESSSSSSSS directive waiting for all the threads to arrive. Because some of the threads never took the appropriate branch, they will never encounter the construct. However, the following kind of simple looping is supported: code C$PAR PARALLEL local(i,j) shared(a) do i= 1,n C$PAR PDO PPPPaaaaggggeeee 11116666 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) do j = 2,n code The distinction here is that all of the threads encounter the work- sharing construct, they all complete it, and they all loop around and encounter it again. Note that this restriction does not apply to the critical section construct, which operates on one thread at a time without regard to any other threads. Parallel regions cannot be lexically nested inside of other parallel regions, nor can work-sharing constructs be nested. However, as an aid to writing library code, you can call an external routine that contains a parallel region even from within a parallel region. In this case, only the first region is actually run in parallel. Therefore, you can create a parallelized routine without accounting for whether it will be called from within an already parallelized routine. NNNNeeeewwww DDDDiiiirrrreeeeccccttttiiiivvvveeeessss ffffoooorrrr TTTTuuuunnnniiiinnnngggg oooonnnn OOOOrrrriiiiggggiiiinnnn2222000000000000 The Origin2000 provides cache-coherent, shared memory in the hardware. Memory is physically distributed across processors. Consequently, references to locations in the remote memory of another processor take substantially longer (by a factor of two or more) to complete than references to locations in local memory. This can severely affect the performance of programs that suffer from a large number of cache misses. The programming support consists of extensions to the existing Multiprocessing Fortran directives (pragmas). The table below summarizes the new directives. Like the other Multiprocessing Fortran directives, these new directives are ignored except under multiprocessor ----mmmmpppp compilation. tab (/); c s l l l l . Summary of New Directives Directive/Description c$distribute A (dist, dist, ...)/Data distribution c$dynamic A/Redistributable annotation c$distribute_reshape B(dist)/Data distribution with reshaping c$redistribute A(dist, dist)/Dynamic data redistribution c$doacross nest (i,j) /Nested doacross c$doacross affinity (i) = data (A(i))/Data-affinity scheduling c$doacross affinity (i) = thread (expr)/Thread-affinity scheduling c$page_place (addr, sz, thread)/Explicit placement of data PPPPaaaaggggeeee 11117777 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) DDDDaaaattttaaaa DDDDiiiissssttttrrrriiiibbbbuuuuttttiiiioooonnnn DDDDiiiirrrreeeeccccttttiiiivvvveeeessss The data distribution directives allow you to specify High Performance Fortran-like distributions for array data structures. For irregular data structures, directives are provided to explicitly place data directly on a specific processor. The cccc$$$$ddddiiiissssttttrrrriiiibbbbuuuutttteeee,,,, cccc$$$$ddddyyyynnnnaaaammmmiiiicccc,,,, and ddddiiiissssttttrrrriiiibbbbuuuutttteeee____rrrreeeesssshhhhaaaappppeeee directives are declarations that must be specified in the declaration part of the program, along with the array declaration. The cccc$$$$rrrreeeeddddiiiissssttttrrrriiiibbbbuuuutttteeee directive is an executable statement and can appear in any executable portion of the program. You can specify a data distribution directive for any local, global, or common-block array. Each dimension of a multi-dimensional array can be independently distributed. The possible distribution types for an array dimension are BBBBLLLLOOOOCCCCKKKK,,,, CCCCYYYYCCCCLLLLIIIICCCC ( expr ) and **** (asterisk not distributed). (A CCCCYYYYCCCCLLLLIIIICCCC distribution with a chunk size that is either greater than 1 or is determined at runtime is sometimes also called BLOCK-CYCLIC ). A BBBBLLLLOOOOCCCCKKKK distribution partitions the elements of the dimension of size _N into _P blocks (one per processor), with each block of size _B = _c_e_i_l_i_n_g(_N/_P) . A CCCCYYYYCCCCLLLLIIIICCCC((((kkkk)))) distribution partitions the elements of the dimension into pieces of size _k each and distributes them sequentially across the processors. A distributed array is distributed across all the processors being used in that particular execution of the program, as determined by the environment variable MMMMPPPP____SSSSEEEETTTT____NNNNUUUUMMMMTTTTHHHHRRRREEEEAAAADDDDSSSS.... If a distributed array is distributed in more than one dimension, then by default the processors are apportioned as equally as possible across each distributed dimension. For instance, if an array has two distributed dimensions, then an execution with 16 processors will assign 4 processors to each dimension (4 x 4=16), whereas an execution with 8 processors will assign 4 processors to the first dimension and 2 processors to the second dimension. You can override this default and explicitly control the number of processors in each dimension using the OOOONNNNTTTTOOOO clause with a data distribution directive. NNNNeeeesssstttteeeedddd DDDDooooaaaaccccrrrroooossssssss DDDDiiiirrrreeeeccccttttiiiivvvveeee The nested doacross directive allows you to exploit nested concurrency in a limited manner. Although true nested parallelism is not supported, you can exploit parallelism across iterations of a perfectly nested loop- nest. For example: PPPPaaaaggggeeee 11118888 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) c$doacross nest(i, j) do i = do j = code enddo enddo This directive specifies that the entire set of iterations across the (i, j) loops can be executed concurrently. The restriction is that the do-i and do-j loops must be perfectly nested, that is, no code is allowed between either the do-i and do-j statements or the enddo-i and enddo-j statements. You can also supply the nest clause with the PCF directive ppppddddoooo.... The existing clauses such as _l_o_c_a_l and _s_h_a_r_e_d behave as before. You can combine a nested doacross with an affinity clause (as shown below), or with a schedtype of simple or interleaved (dynamic and gss are not currently supported). The default is simple scheduling, except when accessing reshaped arrays (see Affinity Scheduling). AAAAffffffffiiiinnnniiiittttyyyy SSSScccchhhheeeedddduuuulllliiiinnnngggg The goal of affinity scheduling is to control the mapping of iterations of a parallel loop for execution onto the underlying threads. Specify affinity scheduling with an additional clause to a cccc$$$$ddddooooaaaaccccrrrroooossssssss directive. An aaaaffffffffiiiinnnniiiittttyyyy clause, if supplied, overrides the SSSSCCCCHHHHEEEEDDDDTTTTYYYYPPPPEEEE clause. DDDDaaaattttaaaa AAAAffffffffiiiinnnniiiittttyyyy The following code shows an example of data affinity: c$distribute A(block) c$doacross affinity(i) = data(A(a*i+b)) do i = 1, N ... enddo The _a and _b must be literal integer constants with a greater than zero. The effect of this clause is to distribute the iterations of the parallel loop to match the data distribution specified for the array _A, such that iteration _i is executed on the processor that owns element _A(_a*_i+_b) based PPPPaaaaggggeeee 11119999 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) on the distribution for _A. In case of a multi-dimensional array, affinity is provided for the dimension that contains the loop-index variable. The loop-index variable cannot appear in more than one dimension in an affinity directive. For example: c$distribute A (block, cyclic(1)) c$doacross affinity (i) = data (A(i+3, j)) do i ... enddo In this example, the loop is scheduled based on the block-distribution of the first dimension. The affinity clause is also available with the PCF ppppddddoooo directive. The default sssscccchhhheeeeddddttttyyyyppppeeee for parallel loops is SSSSIIIIMMMMPPPPLLLLEEEE.... However, under ----OOOO3333 compilation, level loops that reference reshaped arrays default to affinity scheduling for the most frequently accessed reshaped array in the loop (chosen heuristically by the compiler). To obtain SSSSIIIIMMMMPPPPLLLLEEEE scheduling even at ----OOOO3333,,,, you can explicitly specify the schedtype on the parallel loop. Data affinity for loops with non-unit stride can sometimes result in non-linear affinity expressions. In such situations the compiler issues a warning, ignores the affinity clause, and defaults to simple scheduling. DDDDaaaattttaaaa AAAAffffffffiiiinnnniiiittttyyyy ffffoooorrrr RRRReeeeddddiiiissssttttrrrriiiibbbbuuuutttteeeedddd AAAArrrrrrrraaaayyyyssss By default, the compiler assumes that a distributed array is not dynamically redistributed, and directly schedules a parallel loop for the specified data affinity. In contrast, a redistributed array can have multiple possible distributions, and data affinity for a redistributed array must be implemented in the run-time system based on the particular distribution. However, the compiler does not know whether or not an array is redistributed, since the array may be redistributed in another function (possibly even in another file). Therefore, you must explicitly specify the cccc$$$$ddddyyyynnnnaaaammmmiiiicccc declaration for redistributed arrays. You must supply this directive only in those functions that contain a cccc$$$$ddddooooaaaaccccrrrroooossssssss loop with data affinity for that array. This informs the compiler that the array can be dynamically redistributed. Data affinity for such arrays is implemented through a run-time lookup. PPPPaaaaggggeeee 22220000 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) Implementing data affinity through a run-time lookup incurs some extra overhead compared to a direct compile-time implementation. You can avoid this overhead in situations where a subroutine contains data affinity for a redistributed array, and you know the distribution of the array for the entire duration of that subroutine. In this situation, you can supply the cccc$$$$ddddiiiissssttttrrrriiiibbbbuuuutttteeee directive with the particular distribution, and omit the cccc$$$$ddddyyyynnnnaaaammmmiiiicccc directive. By default, the compiler assumes that a distributed array is not redistributed at runtime. As a result, the distribution is known at compile time, and data affinity for the array can be implemented directly by the compiler. In contrast, since a redistributed array can have multiple possible distributions at runtime, data affinity for a redistributed array is implemented in the run-time system based on the distribution at runtime, incurring extra run-time overhead. If an array is redistributed in the program, then you can explicitly specify a cccc$$$$ddddyyyynnnnaaaammmmiiiicccc directive for that array. The only effect of the cccc$$$$ddddyyyynnnnaaaammmmiiiicccc directive is to implement data affinity for that array at runtime rather than at compile time. If you know an array has a specified distribution throughout the duration of a subroutine, then you do not have to supply the cccc$$$$ddddyyyynnnnaaaammmmiiiicccc directive. The result is more efficient compile time affinity scheduling. Since reshaped arrays cannot be dynamically redistributed, this is an issue only for regular data distribution. DDDDaaaattttaaaa AAAAffffffffiiiinnnniiiittttyyyy ffffoooorrrr aaaa FFFFoooorrrrmmmmaaaallll PPPPaaaarrrraaaammmmeeeetttteeeerrrr You can supply a cccc$$$$ddddiiiissssttttrrrriiiibbbbuuuutttteeee directive on a formal parameter, thereby specifying the distribution on the incoming actual parameter. If different calls to the subroutine have parameters with different distributions, then you can omit the c$distribute directive on the formal parameter; data affinity loops in that subroutine are automatically implemented through a run-time lookup of the distribution. (This is permissible only for regular data distribution. For reshaped array parameters, the distribution must be fully specified on the formal parameter.) TTTThhhhrrrreeeeaaaadddd AAAAffffffffiiiinnnniiiittttyyyy Similar to data affinity, you can specify thread affinity as an additional clause on a cccc$$$$ddddooooaaaaccccrrrroooossssssss directive. The syntax for thread affinity is as follows: c$doacross affinity (i) = thread(expr) The effect of this directive is to execute iteration _i on the thread number given by the user-supplied expression (modulo the number of PPPPaaaaggggeeee 22221111 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) threads). SSSSppppeeeecccciiiiffffyyyyiiiinnnngggg PPPPrrrroooocccceeeessssssssoooorrrr TTTTooooppppoooollllooooggggyyyy WWWWiiiitttthhhh tttthhhheeee OOOONNNNTTTTOOOO This directive allows you to specify the processor topology when two (or more) dimensions of processors are required. For instance, if an array is distributed in two dimensions, then you can use the OOOONNNNTTTTOOOO clause to specify how to partition the processors across the distributed dimensions. Or, in a nested doacross with two or more nested loops, you can use the OOOONNNNTTTTOOOO clause to specify the partitioning of processors across the multiple parallel loops. For example: c Assign processor in the ratio 1:2 to the two dimension real*8 A (100, 200) c$distribute A (block, block) onto (1, 2) c Use 2 processors in the do-i loop, and the remaining in the do-j loop c$doacross nest (i, j) onto (2, *) do i = do j = code enddo enddo TTTTyyyyppppeeeessss ooooffff DDDDaaaattttaaaa DDDDiiiissssttttrrrriiiibbbbuuuuttttiiiioooonnnn There are two types of data distribution: rrrreeeegggguuuullllaaaarrrr and rrrreeeesssshhhhaaaappppeeeedddd.... The following sections describe each of these distributions. RRRReeeegggguuuullllaaaarrrr DDDDaaaattttaaaa DDDDiiiissssttttrrrriiiibbbbuuuuttttiiiioooonnnn The regular data distribution directives try to achieve the desired distribution solely by influencing the mapping of virtual addresses to physical pages without affecting the layout of the data structure. Since the granularity of data allocation is a physical page (at least 16 Kbytes), the achieved distribution is limited by the underlying page- granularity. However, the advantages are that regular data distribution directives can be added to an existing program without any restrictions, PPPPaaaaggggeeee 22222222 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) and can be used for affinity scheduling. Distributed arrays can be dynamically redistributed with the following redistribute statement: c$redistribute A (block, cyclic(k)) The cccc$$$$rrrreeeeddddiiiissssttttrrrriiiibbbbuuuutttteeee is an executable statement that changes the distribution "permanently" (or until another redistribute statement). It also affects subsequent affinity scheduling. The cccc$$$$ddddyyyynnnnaaaammmmiiiicccc directive specifies that the named array is redistributed in the program, and is useful in controlling affinity scheduling for dynamically redistributed arrays. DDDDaaaattttaaaa DDDDiiiissssttttrrrriiiibbbbuuuuttttiiiioooonnnn WWWWiiiitttthhhh RRRReeeesssshhhhaaaappppiiiinnnngggg Similar to regular data distribution, the reshape directive specifies the desired distribution of an array. In addition, however, the reshape directive declares that the program makes no assumptions about the storage layout of that array. The compiler performs aggressive optimizations for reshaped arrays that violate standard Fortran77 layout assumptions but guarantee the desired data distribution for that array. The reshape directive accepts the same distributions as the regular data distribution directive, but uses a different keyword, as shown below: c$distribute_reshape (block, cyclic(1)) RRRReeeessssttttrrrriiiiccccttttiiiioooonnnnssss oooonnnn RRRReeeesssshhhhaaaappppeeeedddd AAAArrrrrrrraaaayyyyssss Since the distribute_reshape directive specifies that the program does not depend on the storage layout of the reshaped array, restrictions on the arrays that can be reshaped include the following: The distribution of a reshaped array cannot be changed Initialized data cannot be reshaped. Arrays that are explicitly allocated through alloca/malloc and accessed through pointers cannot be reshaped. An array that is equivalenced to another array cannot be reshaped. I/O for a reshaped array cannot be mixed with namelist I/O or a function call in the same I/O statement. A COMMON block containing a reshaped array cannot be linked ----XXXXllllooooccccaaaallll Caution: This user error is not caught by the compiler/linker. PPPPaaaaggggeeee 22223333 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) If a reshaped array is passed as an actual parameter to a subroutine, two possible scenarios exist: The array is passed in its entirety ( _c_a_l_l _f_u_n_c(_A) passes the entire array _A, whereas _c_a_l_l _f_u_n_c(_A(_i,_j)) passes a portion of _A ). The compiler automatically clones a copy of the called subroutine and compiles it for the incoming distribution. The actual and formal parameters must match in the number of dimensions, and the size of each dimension. You can restrict a subroutine to accept a particular reshaped distribution on a parameter by specifying a distribute_reshape directive on the formal parameter within the subroutine. All calls to this subroutine with a mismatched distribution will lead to compile- or link-time errors. A portion of the array can be passed as a parameter, but the callee must access only a single processor's portion. If the callee exceeds a single processor's portion, then the results are undefined. You can use intrinsics to access details about the array distribution. EEEErrrrrrrroooorrrr----DDDDeeeetttteeeeccccttttiiiioooonnnn SSSSuuuuppppppppoooorrrrtttt Most errors in accessing reshaped arrays are caught either at compile time or at link time. These include: Inconsistencies in reshaped arrays across COMMON blocks (including across files) Declaring a reshaped array EQUIVALENCED to another array Inconsistencies in reshaped distributions on actual and formal parameters Other errors such as disallowed I/O statements involving reshaped arrays, reshaping initialized data, or reshaping dynamically allocated data Errors such as matching the declared size of an array dimension typically are caught only at runtime. The compiler option, ----MMMMPPPP::::cccchhhheeeecccckkkk____rrrreeeesssshhhhaaaappppeeee====oooonnnn,,,, generates code to perform these tests at runtime. These run-time checks are not generated by default, since they incur overhead, but are useful during debugging. The runtime checks include: Inconsistencies in array-bound declarations on each actual and formal parameter Inconsistencies in declared bounds of a formal parameter that corresponds to a portion of a reshaped actual parameter. PPPPaaaaggggeeee 22224444 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) QQQQuuuueeeerrrryyyy IIIInnnnttttrrrriiiinnnnssssiiiiccccssss ffffoooorrrr DDDDiiiissssttttrrrriiiibbbbuuuutttteeeedddd AAAArrrrrrrraaaayyyyssss You can use the following set of intrinsics to obtain information about an individual dimension of a distributed array. Fortran array dimensions are numbered starting at 1. All routines work with 64-bit integers as shown below, and return -1 in case of an error (except ddddssssmmmm____tttthhhhiiiissss____ssssttttaaaarrrrttttiiiinnnnggggiiiinnnnddddeeeexxxx where -1 may be a legal return value). iiii8888 ==== ddddssssmmmm____nnnnuuuummmmtttthhhhrrrreeeeaaaaddddssss ((((AAAA,,,, jjjj8888)))) Called with a distributed array and a dimension number. Returns the number of threads in that dimension. iiii8888 ==== ddddssssmmmm____cccchhhhuuuunnnnkkkkssssiiiizzzzeeee ((((AAAA,,,, jjjj8888)))) Returns the chunk size (ignoring partial chunks) in the given dimension for each of block, cyclic(..), and star distributions. iiii8888 ==== ddddssssmmmm____tttthhhhiiiissss____cccchhhhuuuunnnnkkkkssssiiiizzzzeeee ((((AAAA,,,, jjjj8888,,,, kkkk8888)))) Returns the chunk size for the chunk containing the given index value for each of block, cyclic(..), and star. This value may be different from dsm_chunksize due to edge effects that may lead to a partial chunk. iiii8888 ==== ddddssssmmmm____rrrreeeemmmm____cccchhhhuuuunnnnkkkkssssiiiizzzzeeee ((((AAAA,,,, jjjj8888,,,, kkkk8888)))) Returns the remaining chunk size from index to the end of the current chunk, inclusive of each end point. Essentially it is the distance from index to the end of that contiguous block, inclusive. iiii8888 ==== ddddssssmmmm____tttthhhhiiiissss____ssssttttaaaarrrrttttiiiinnnnggggiiiinnnnddddeeeexxxx ((((AAAA,,,, jjjj8888,,,, kkkk8888)))) Returns the starting index value of the chunk containing the supplied index. iiii8888 ==== ddddssssmmmm____nnnnuuuummmmcccchhhhuuuunnnnkkkkssss ((((AAAA,,,, jjjj8888)))) Returns the number of chunks (including partial chunks) in given dim for each of block, cyclic(..), and star distributions. iiii8888 ==== ddddssssmmmm____tttthhhhiiiissss____tttthhhhrrrreeeeaaaaddddnnnnuuuummmm ((((AAAA,,,, jjjj8888,,,, kkkk8888)))) Returns the thread number for the chunk containing the given index value for each of block, cyclic(..), and star distributions. iiii8888 ==== ddddssssmmmm____ddddiiiissssttttrrrriiiibbbbuuuuttttiiiioooonnnn____bbbblllloooocccckkkk ((((AAAA,,,, jjjj8888)))) iiii8888==== ddddssssmmmm____ddddiiiissssttttrrrriiiibbbbuuuuttttiiiioooonnnn____ccccyyyycccclllliiiicccc ((((AAAA,,,, jjjj8888)))) iiii8888 ==== ddddssssmmmm____ddddiiiissssttttrrrriiiibbbbuuuuttttiiiioooonnnn____ssssttttaaaarrrr ((((AAAA,,,, jjjj8888)))) Boolean routines to query the distribution of a given dimension. PPPPaaaaggggeeee 22225555 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) iiii8888 ==== ddddssssmmmm____iiiissssrrrreeeesssshhhhaaaappppeeeedddd ((((AAAA)))) Boolean routine to query whether A is reshaped or not. iiii8888 ==== ddddssssmmmm____iiiissssddddiiiissssttttrrrriiiibbbbuuuutttteeeedddd ((((AAAA)))) Boolean routine to query whether A is distributed (regular or reshaped) or not. EEEExxxxpppplllliiiicccciiiitttt PPPPllllaaaacccceeeemmmmeeeennnntttt ooooffff DDDDaaaattttaaaa For irregular data structures, you can explicitly place data in the physical memory of a particular processor using the following directive: c$page_place (addr, size, threadnum) where aaaaddddddddrrrr is the starting address, ssssiiiizzzzeeee is the size in bytes, and tttthhhhrrrreeeeaaaaddddnnnnuuuummmm is the number of the destination processor. This directive causes all the pages spanned by the virtual address range aaaaddddddddrrrr ((((aaaaddddddddrrrr++++ssssiiiizzzzeeee)))) to be allocated from the local memory of processor number tttthhhhrrrreeeeaaaaddddnnnnuuuummmm.... It is an executable statement; therefore, you can use it to place either statically or dynamically allocated data. An example of this directive is as follows: real*8 a(100) c$page_place (a, 800, 3) IIIImmmmpppplllleeeemmmmeeeennnnttttaaaattttiiiioooonnnn DDDDeeeettttaaaaiiiillllssss _C_h_a_p_t_e_r _5 of the _M_I_P_S_p_r_o _F_o_r_t_r_a_n _7_7 _P_r_o_g_r_a_m_m_e_r'_s _G_u_i_d_e describes how the compiler implements reshaped arrays and BBBBLLLLOOOOCCCCKKKK distribution. It also describes the differences between regular and reshaped data distribution. OOOOppppttttiiiioooonnnnaaaallll EEEEnnnnvvvviiiirrrroooonnnnmmmmeeeennnntttt VVVVaaaarrrriiiiaaaabbbblllleeeessss aaaannnndddd CCCCoooommmmppppiiiilllleeee----TTTTiiiimmmmeeee OOOOppppttttiiiioooonnnnssss You can control various run-time features through the following optional environment variables: ____DDDDSSSSMMMM____OOOOFFFFFFFF Disables non-uniform memory access (NUMA) specific calls (for example, to allocate pages from a particular memory). ____DDDDSSSSMMMM____VVVVEEEERRRRBBBBOOOOSSSSEEEE PPPPaaaaggggeeee 22226666 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) Prints messages about parameters being used during execution. ____DDDDSSSSMMMM____PPPPPPPPMMMM Specifies the number of processors to use per memory module. Must be set to an integer value; to use only one processor per memory module, set this variable to 1. PPPPAAAAGGGGEEEESSSSIIIIZZZZEEEE____SSSSTTTTAAAACCCCKKKK,,,, PPPPAAAAGGGGEEEESSSSIIIIZZZZEEEE____DDDDAAAATTTTAAAA,,,, PPPPAAAAGGGGEEEESSSSIIIIZZZZEEEE____TTTTEEEEXXXXTTTT Specifies the desired page size in kilobytes. Must be set to an integer value. ____DDDDSSSSMMMM____MMMMIIIIGGGGRRRRAAAATTTTIIIIOOOONNNN Automatic page migration is OOOOFFFFFFFF by default. This variable, if set, must be set to one of OOOOFFFFFFFF diables migration entirely (default) OOOONNNN enables migration except for explicitly placed data (using page_place or a data distribution directive) AAAALLLLLLLL____OOOONNNN enables migration for ALL data. ____DDDDSSSSMMMM____RRRROOOOUUUUNNNNDDDD____RRRROOOOBBBBIIIINNNN Request round-robin data allocation across memories rather than first touch, for all of stack, data, and text segments. Default is first- touch. MMMMPPPP____SSSSUUUUGGGGNNNNUUUUMMMMTTTTHHHHDDDD If set, this variable enables the use of dynamic threads in the multiprocessor (MP) runtime. Essentially, with dynamic threads the MP runtime automatically adjusts the number of threads used for a parallel loop at runtime based on the overall system load. This features improves the overall throughput of the system. Furthermore, by avoiding excessive concurrency, this feature can reduce delays at synchronization points within a single application. UUUUsssseeeeffffuuuullll ccccoooommmmppppiiiilllleeee----ttttiiiimmmmeeee ooooppppttttiiiioooonnnnssss The following options are part of the ----MMMMPPPP option control group supported by f77(1). ----MMMMPPPP::::ddddssssmmmm===={{{{oooonnnn,,,, ooooffffffff}}}} (default on) All the data-distribution and scheduling features described in this man page are enabled by default under ----mmmmpppp compilations. To disable all the PPPPaaaaggggeeee 22227777 MMMMPPPP((((3333FFFF)))) MMMMPPPP((((3333FFFF)))) DSM-specific directives (e.g. distribution and affinity scheduling), compile with ----MMMMPPPP::::ddddssssmmmm====ooooffffffff.... ----MMMMPPPP::::cccclllloooonnnneeee===={{{{oooonnnn,,,, ooooffffffff}}}} (default on) The compiler automatically clones procedures that are called with reshaped arrays as parameters for the incoming distribution. However, if you have explicitly specified the distribution on all relevant formal parameters, then you can disable auto-cloning with -MP:clone=off. The consistency checking of the distribution between actual and formal parameters is not affected by this flag, and is always enabled. ----MMMMPPPP::::cccchhhheeeecccckkkk____rrrreeeesssshhhhaaaappppeeee===={{{{oooonnnn,,,, ooooffffffff}}}} (default off) This flag enables generation of the runtime consistency checks across procedure boundaries when passing reshaped arrays (or portions thereof) as parameters. ----MMMMPPPP::::oooolllldddd____mmmmpppp===={{{{oooonnnn,,,, ooooffffffff}}}} The -_m_p flag enables the processing of the parallel (MP) directives, including the original SGI/PCF directives as well as the OpenMP directives. When set to off, this flag disables the processing of the original SGI/PCF directives but retains the processing of OpenMP directives. ----MMMMPPPP::::ooooppppeeeennnn____mmmmpppp===={{{{oooonnnn,,,, ooooffffffff}}}} The -_m_p flag enables the processing of the parallel (MP) directives, including the original SGI/PCF directives as well as the OpenMP directives. When set to off, this flag disables the processing of the OpenMP directives but retains the processing of the original SGI/PCF directives. ----MMMMPPPP::::ooooppppeeeennnn____mmmmpppp===={{{{oooonnnn,,,, ooooffffffff}}}} SSSSEEEEEEEE AAAALLLLSSSSOOOO f77(1), sync(3f), _M_I_P_S_p_r_o _F_o_r_t_r_a_n _7_7 _P_r_o_g_r_a_m_m_e_r'_s _G_u_i_d_e _M_I_P_S_p_r_o _P_o_w_e_r _F_o_r_t_r_a_n _7_7 _P_r_o_g_r_a_m_m_e_r'_s _G_u_i_d_e PPPPaaaaggggeeee 22228888